Load and Explore the data

Lets find the variable names

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Lets take a look at the variable values

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Let me convert the units from g/dm^3 to mg/dm^3 where applicable.

I am going to create a few bucket variable for use in the multivariate plots. First I will create a bucket variable for alcohol.

## [1] "(8,9]"   "(9,10]"  "(10,11]" "(11,12]" "(12,13]" "(13,15]"
## 
##   (8,9]  (9,10] (10,11] (11,12] (12,13] (13,15] 
##      37     710     444     267     118      23

As you can see there are very few data points in the first and last buckets. So lets change the grouping of data in the buckets a little.

## [1] "(8,10]"  "(10,11]" "(11,12]" "(12,15]"
## 
##  (8,10] (10,11] (11,12] (12,15] 
##     747     444     267     141

Do the same with for pH

## [1] "(2.5,3]"   "(3,3.2]"   "(3.2,3.4]" "(3.4,3.6]" "(3.6,4.1]"
## 
##   (2.5,3]   (3,3.2] (3.2,3.4] (3.4,3.6] (3.6,4.1] 
##        35       353       824       339        48

Let us take a look at the summary of the data

##        X          fixed.acidity   volatile.acidity  citric.acid  
##  Min.   :   1.0   Min.   : 4600   Min.   : 120.0   Min.   :   0  
##  1st Qu.: 400.5   1st Qu.: 7100   1st Qu.: 390.0   1st Qu.:  90  
##  Median : 800.0   Median : 7900   Median : 520.0   Median : 260  
##  Mean   : 800.0   Mean   : 8320   Mean   : 527.8   Mean   : 271  
##  3rd Qu.:1199.5   3rd Qu.: 9200   3rd Qu.: 640.0   3rd Qu.: 420  
##  Max.   :1599.0   Max.   :15900   Max.   :1580.0   Max.   :1000  
##  residual.sugar    chlorides      free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :  900   Min.   : 12.00   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.: 1900   1st Qu.: 70.00   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median : 2200   Median : 79.00   Median :14.00       Median : 38.00      
##  Mean   : 2539   Mean   : 87.47   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.: 2600   3rd Qu.: 90.00   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :15500   Max.   :611.00   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   : 990.1   Min.   :2.740   Min.   : 330.0   Min.   : 8.40  
##  1st Qu.: 995.6   1st Qu.:3.210   1st Qu.: 550.0   1st Qu.: 9.50  
##  Median : 996.8   Median :3.310   Median : 620.0   Median :10.20  
##  Mean   : 996.7   Mean   :3.311   Mean   : 658.1   Mean   :10.42  
##  3rd Qu.: 997.8   3rd Qu.:3.400   3rd Qu.: 730.0   3rd Qu.:11.10  
##  Max.   :1003.7   Max.   :4.010   Max.   :2000.0   Max.   :14.90  
##     quality      alcohol.bucket alcohol.bucket2     pH.bucket  
##  Min.   :3.000   (8,9]  : 37    (8,10] :747     (2.5,3]  : 35  
##  1st Qu.:5.000   (9,10] :710    (10,11]:444     (3,3.2]  :353  
##  Median :6.000   (10,11]:444    (11,12]:267     (3.2,3.4]:824  
##  Mean   :5.636   (11,12]:267    (12,15]:141     (3.4,3.6]:339  
##  3rd Qu.:6.000   (12,13]:118                    (3.6,4.1]: 48  
##  Max.   :8.000   (13,15]: 23

Univariate Plots

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

Most wines are of medium quality (between 5 and 6). Only 18 have a quality rating of 8 and 10 are at the bottom of the quality scale.

The distribution has a long tail. Let us cut the limit on the x-axis and try to center the plot

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     900    1900    2200    2539    2600   15500

Most wines have a residual sugar level between 1800 and 2500 mg/dm^3. The distribution has a long tail so I want to transform it into a log scale.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     900    1900    2200    2539    2600   15500

A peak is now more clearly visible

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Most wines have a alcohol level of around 10%-12% The distribution has a long tail which I trimmed in order to see the distribution better.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Most wines have a total sulphur diaoxide level between 17 and 47 mg/dm^3. The Max value is 289 mg/dm^3 but the 3rd quartile is only at 62. This distribution also has a long tail which I transformed into a log scale.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   70.00   79.00   87.47   90.00  611.00

Most wines have a chlorides level between 65 and 90 mg/dm^3. This distribution also has a long tail which I trimmed in order to see the distribution better. The Max value is 611 mg/dm^3 but the 3rd quartile is only at 90

Univariate Analysis

What is the structure of your dataset?

There are 1599 observations in the dataset with 13 features (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol and quality). All the variables are scalar variables.

Wine quality is rated on a scale or 1-10 (with 1 being the worst and 10 being the best). However, this dataset only has data for wine quality between 3 and 8. Most of the data is for wines in the 5-6 quality rating.

What is/are the main feature(s) of interest in your dataset?

The main feature we are interested in is quality. We would like to determine if the quality of the wine is influenced by any of the chemical properties of the wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Given the description of the features and in looking at the data in the univariate plots section, I think alcohol, density, pH, citric acid and chlorides might play a role in the quality of the wine

Did you create any new variables from existing variables in the dataset?

I created a couple ordered factors from the alcohol and pH features, which I believe would be useful when comparing these features against quality along with other features.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Many of the fields had a unit of g/dm^3 while some of them had a unit of mg/dm^3. I changed the values for fields in g/dm^3 to mg/dm^3 by multiplying the value by 1000, so we would be consistent in the terminology.

Bivariate Plots Section

## Loading required package: lattice
## Loading required package: MASS
## 
## Attaching package: 'memisc'
## 
## The following object is masked from 'package:scales':
## 
##     percent
## 
## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts
## 
## The following objects are masked from 'package:base':
## 
##     as.array, trimws

Let’s look at the correlation for a couple other variables

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971
## 
##  Pearson's product-moment correlation
## 
## data:  wines$total.sulfur.dioxide and wines$chlorides
## t = 1.8964, df = 1597, p-value = 0.05809
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.001624446  0.096198080
## sample estimates:
##        cor 
## 0.04740047

Let us explore the correlation between quality and citric acid

On average, the more the citric acid content, the better the wine quality.

Let us now look at alcohol and wine quality

On average, the more the alcohol content, the better the wine quality.

Let us now look at density and wine quality

Density has an inverse correlation to the wine quality - the higher quality wines have lower density on average.

There is a negative correlation between average volatile acidity and quality.

There is a slight negative correlation between average pH and wine quality albeit very small.

There is a small negative correlation between wine quality and the average chlorides

There is no clear correlation between total sulfur dioxide and the quality of the wine

There is no clear correlation between free sulfur dioxide and the quality of the wine. Wines with lower free sulfur dioxide maybe classified as good or bad wines

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Quality correlates strongly with alcohol, citric acid and density. It also has a negative correlation to volatile acidity, pH and chlorides.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Volatile acidity and pH seem to be correlated, which makes sense since pH is a measure of the level of acidity.

What was the strongest relationship you found?

The strongest relationship seems to be between alcohol and quality as well as citric acid and quality. It would be interested to put these together in a multivariate plot and see how it plays out.

MultiVariate Plots Section

Let us compare fixed acidity against density as the cor.test indicates some correlation

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

While there is a small correlation between fixed acidity and density of wines, the quality of the wine does not seem to be correlated to these variables.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

Again, I found no correlation between quality and the fixed_acidity and citric acid which seem to be correlated.

Let us explore the correlation between quality and citric acid some more

The above graph shows that quality of wine is better with higher levels of citric acid and alcohol. However, if the alcohol level is above 13%, then the citric acid levels for a good wine. Also, if the alcohol level is below 9% then wine quality is mostly poor to medium.

There doesn’t seem to be a correlation between density together with alcohol on the quality of wine

There is a strong negative correlation between volatile acidity and pH and the quality of the wine. For pH between 3.0 and 3.6, a lower volatile acidity level makes for a higher quality wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   70.00   79.00   87.47   90.00  611.00

Trim the values for chorides above 150 as most of the data is for chlorides below 150

Again we see that there is a strong correlation between Chlorides, pH and quality of wine. As pH increases, lower levels of chlorides makes for better wines. However, when the pH level is higher than 3.6, the chloride levels seem a little higher. The dashed-line shows how quality varies when median value of chlorides.

Multivariate analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

When I mapped citric.acid against quality and broke it down by acohol, a significant correlation emerged from that plot. Also, I found that the correlation between quality and chlorides becomes pronouced when we factor in pH.

Were there any interesting or surprising interactions between features?

While there are some inconsistencies in the for plot chlorides, quality and pH, we need to remember that most of the data in this data set was for quality around 5-6 so we might not have sufficient information to clearly find the correlation

Final Plots and Summary

Plot One

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

Description 1

Most wines are of medium (5-6) quality with only 18 in the best (8) quality and 10 in the worst (3) quality. This might skew the analysis of the data towards parameters that make for mediocre wine, since we don’t have sufficient evidence for what makes good wine or bad wine.

Plot Two

## 
##  Pearson's product-moment correlation
## 
## data:  wines$alcohol and wines$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663
## 
##  Pearson's product-moment correlation
## 
## data:  wines$citric.acid and wines$alcohol
## t = 4.4188, df = 1597, p-value = 1.059e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.06121189 0.15807276
## sample estimates:
##       cor 
## 0.1099032

Description 2

Alcohol seems to be correlated to quality with a Pearson’s correlation of nearly .48 - this was the highest correlation I found to quality of the wine and the chemical components in the wine. I also noted that Citric Acid had a high correlation to the quality of wine as well - nearly .23. Let us combine these two variables and compare them to the quality of the wine.

Plot Three

## [1] "(8,10]"  "(10,11]" "(11,12]" "(12,15]"

Description 3

When alcohol is added to the plot, the correlation becomes clearly evident. It would appear that as the alcohol content of the wine increases, more citric acid contributes to a better quality of wine. However, a higher concentration of alcohol, the citric acid levels for a good quality wine become more variable.

Reflection

The data set was small and skewed towards the medium quality wines. When using bivariate plots, it was hard to find any correlation between the variables although the ggpairs plot showed strong correlation between citric acid, alcohol and pH. I then mapped citric acid and alcohol to the quality of wine and found a strong correlation there. The plot showed that at higher alcohol level, more citric acid appears to improve the quality of the wine. I also found a weak negative correlation between the level of chlorides, the pH and the quality of wines.

I wonder if having more data in the wines data set would change the results of this analysis. Since the quality of wines is determined by individual tasters, and given that there are several factors that play into the individual’s rating of the wine (including smell and taste which are hard to quantify), I wonder whether it can always be consistent with the chemical composition of the wine.

The above analysis has attempted to bring some correlation to wine quality and its chemical composition, but I don’t believe we have enough evidence to come up with a model for wine quality yet.